<!DOCTYPE html>
<html lang="en">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Improving GCC's Infrastructure (Central Flow Graph)</title>
<link rel="stylesheet" type="text/css" href="https://gcc.gnu.org/gcc.css" />
</head>
<body>
<h1>Improving GCC's Infrastructure
(Control Flow Graph)</h1>

<p>This page describes ongoing work to improve GCC's infrastructure
for RTL based optimizers. The work is done on a branch in GCC's CVS
tree called <code>cfg-branch</code>, which is available to experiment
in. As the patches stabilize, they will be submitted for review and
acceptance into the mainline development tree. Please contact <a
href="mailto:jh@suse.cz">Jan Hubicka &lt;jh@suse.cz&gt;</a> if you
would like to contribute.</p>

<h2>Background &amp; Rationale</h2>

<p>The purpose of work on the CFG branch is to update GCC's older
optimization passes to use the modern infrastructure which was not
available when they were written.  A second purpose is to continue
improving the infrastructure for RTL-based optimizations.</p>

<p>Currently, work concentrates on the control flow graph (CFG) data
structure, originally contributed by Richard Henderson.  We would like
to convert all of the RTL-based optimizer passes to use it and keep it
accurate.  This will allow us to store more information directly in
this structure, instead of using "notes" in the instruction stream
which are difficult to keep accurate during optimization.</p>

<h2>Goals</h2>

<p>In addition to improving the infrastructure and updating the
existing RTL-based optimization passes, we are implementing new
optimization passes.  We plan to implement optimizations that perform
code duplication first such as superblock formation and loop
peeling. We also plan to reimplement the loop unrolling pass.</p>

<h2>Detailed Description</h2>

<p>Some ideas that the cfg-branch will implement are known only in the
compiler research literature but are new concepts for GCC.  We'd like
to explain the concepts briefly, give an overview of what we want to
achieve in GCC and describe the current status.</p>

<h3>Loop Peeling</h3>

<p>This work is done by <a href="mailto:jh@suse.cz">Jan Hubicka</a>.</p>

<h4>Theory</h4>

<p><em>Loop Peeling</em> is a form of loop unrolling where the
<em>first</em> iterations from a loop are unrolled.  It removes
<code>i</code> iterations from the beginning of a loop and adds
<code>i</code> copies of the loop body directly before the start of
the loop.</p>

<h4>Implementation in GCC</h4>

<p>The loop peeling optimization will be done as part of the
superblock formation.  The tracer should peel as many iterations as
can be predicted.  Zdenek has also implemented simple loop peeling
as part of the new loop optimizer.
</p>

<h4>Status</h4>

<p>The code is on the branch.</p>

<p>The tracer peels loops with one predicted iteration. We should try
to peel more than one iteration.</p>

<h3>Loop optimizer re-implementation</h3>

<p>This work is done by <a href=
"mailto:rakdver@atrey.karlin.mff.cuni.cz">Zdenek Dvorak</a>.</p>

<h4>Theory</h4>

<p>The current loop optimizer uses information passed by the front end
to discover loop constructs to simplify flow analysis.
It is difficult to keep the information up-to-date and nowday
it is easy to implement the loop discovery code on CFG.
</p>

<h4>Implementation in GCC</h4>

<p>
We want to use the Michael Hayes' loop discovery code and slowly
replace existing features of loop optimizer by new one.
So far Zdenek has modified the datastructure to allow easier
updating and implemented loop unrolling, peeling and unswitching
on the top of it.
</p>

<h4>Status</h4>

<p>The main changes the loop infrastructure has been merged to mainline.
Rest is being tunned on the branch.</p>

<h3>Software Trace Cache</h3>

<p>This work is done by <a
href="mailto:zlomj9am@artax.karlin.mff.cuni.cz">Josef Zlomek</a>.</p>

<h4>Theory</h4>

<p><em>Software Trace Cache</em> is a basic block layout algorithm
that also does code duplication during the process.  While optimal
code layout is an NP complete problem, allowing a limited amount of
duplication should eliminate the common drawbacks of current
algorithms and keep code size under control.  The purpose of this pass
is to minimize the number of branches and cache misses.  It uses code
duplication to avoid jumps.  A trivial example is copying the return
instruction instead of jumping to it.  See <a href="#ref6">[6]</a> for a
detailed description.</p>

<h4>Implementation in GCC</h4>

<p>The implementation will only do intraprocedural optimizations since
interprocedural optimizations are out of reach of current GCC.</p>

<h4>Status</h4>

<p>The code is in the branch. The exact benefits needs to be measured
but on non-PDO runs it brings 5% to eon benchmark (from CPU2000).</p>

<h3>Web Construction Pass</h3>

<p>This pass constructs webs as commonly used for register allocation
purposes.  After that each web gets assigned individual pseudo.  This
allows our register allocation pass to operate on pseudos directly,
but also strengthens several other optimization passes, such as CSE,
loop optimizer and trivial dead code remover.</p>

<p>While future SSA passes will have same effect on the low-level RTL
generation, we still believe this pass to be useful right before the
register allocation pass or after the code duplication passes that
need to unshare temporaries, such as tracer or loop unroller.</p>

<h4>Implementation in GCC</h4>

<p>The implementation of the web pass uses Michael Hayes's dataflow
module to construct du-chains and unionfind to produce webs.</p>

<h4>Status</h4>

<p>The code is on the branch.</p>

<p>This pass improves SPECint2000 performance by roughly 1%. We need
to work on updating debug information in this pass.</p>

<h3>Register coalescing Pass</h3>

<p>This pass coalesces multiple registers into single one in order to
avoid register to register copies that our register allocator is not
able to deal with very well.  It is a kind of temporary solution until
the new register allocator is in place. At that point, the benefits are
questionable, but still it is more effective (and probably less
expensive) than the current copy propagation implementation.</p>

<h4>Implementation in GCC</h4>

<p>It is designed as stand alone pass constructing conflict graph and
coalescing registers run after GCSE.</p>

<h4>Status</h4>

<p>The code is on the branch.</p>

<p>This pass improves some of SPECint2000 tests and degrades others basically
because of lack of live range splitting and increased register pressure in
final program.  It helps by about 1 point on peak SPECint runs.</p>

<h3>Jump Combining Pass</h3>

<p>This work is done by <a href="mailto:jh@suse.cz">Jan Hubicka</a>.</p>

<h4>Theory</h4>

<p>This optimization is very close to instruction combining. The
compiler discovers conditionals with common expression like patterns
<code>(a || b)</code> and <code>(a &amp;&amp; b)</code> and attempts
to create compound conditions.</p>

<p>When using condition registers arithmetics this is always possible,
but not always profitable for all architectures. Trivial, still
common, cases can be handled.  Some examples for these conversions
are: <code>(a || b)</code> can be converted to <code>(a | b)</code>,
<code>(a == b || c == d)</code> to <code>(a^b)|(c^d)</code> and,
<code>(a == 3 || a == 4)</code> to <code>((unsigned)(a - 3) &lt;
2)</code>.</p>

<h4>Implementation in GCC</h4>

<p>This is implemented as part of the if-conversion pass.  It
basically finds the patterns of two jumps in the CFG that can be
combined and then attempts to to apply several simplifications.
Profile data is used to control the amount of optimizations.</p>

<h4>Status</h4>

<p>The code is on the branch. Benefits for Athlon CPU are about 0.5%.
The code is being reviewed for inclusion by Richard Henderson.</p>

<h3>Thread Safe Profiling</h3>

<p>This work is done by <a
href="mailto:rakdver@atrey.karlin.mff.cuni.cz">Zdenek Dvorak</a>.</p>

<h4>Theory</h4>

<p>GCC currently does not support profiling of threads.</p>

<h4>Implementation in GCC</h4>

<p>This is only needed for machines that have no atomic adds and for
SMP machines.  We currently plan to do this with per-thread counters.
At the start of each function, a function from GCC's runtime
environment will be called that returns the pointer.</p>

<h4>Status</h4>

<p>A first version has been written and is installed to the branch.
The merging to mainline is problematic and the code may be dropped</p>



<h4>Procedure splitting</h4>

<p>To shorten branches and avoid pollution it is desirable to put
frequently executed procedures together and split their infrequently
executed parts.</p>

<h4>Status</h4>

<p>We don't split the function body, but we do categorize functions
into different parts.
A first version has been written and has been installed on the branch.
It adds about 1% to spec2000 runs with profile feedback.</p>

<h3>MIDlevel RTL</h3>

<p>This work is done by <a
href="mailto:jh@suse.cz">Jan Hubicka</a>.</p>

<h4>Theory</h4>

<p>The current RTL representation is too low level for a number of
optimizations that we want to implement.  This makes the
implementation difficult and even ineffective in some cases.  MidRTL
will be essentially a RTL representation of a program lacking the tons
of architectural details making the transformations of instruction
chain much easier and therefore being the basis for new optimization
passes.</p>

<h4>Implementation in GCC</h4>

<p>The following outline has been written by Jeff Law:</p>
<pre>
Particularly for the task of building the machine description for the generic
RTL and the translation from generic RTL into target specific RTL.  Those two
(closely related) tasks are the biggest hunks of infrastructure we need.

Conceptually what we want is a md file which describes the set of generic
RTL patterns.  We're not precisely sure on what predicates to use for
operands, so invent something like "generic_operand" which we will more
precisely define later.  Make all operands use "generic_operand".  At
least initially accept MEMs, REGs and all constants.

The only "tricky" part of "generic_operand" is memory addresses; at least
at this stage, generic_operand should accept a MEM with a "generic_address".
A "generic address" should be any "generic operand", except another MEM.
(ie we don't allow (MEM (MEM)) at this stage.  We will likely refine this
later.

Once you've got a rough framework in place, you have to build out some
infrastructure for lowering.  Specifically we'll need the ability to have
two recognizers.  Early in compilation we'll only recognize the generic
patterns.  When we start lowering we want to recognize the target specific
expanders, patterns, splitters, etc.  So you'll probably have to hack up
genrecog to build two insn-recog files with different prefixes for anything
with external scope and some infrastructure to switch between them based
on a state variable which indicates we've started the lowering process.

For the actual lowering process, we want to avoid twiddling the existing
back ends as much as possible.  If I recall, the general idea Richard and
I came up with was:

For each insn:

  Try to see if it matches a pattern in the target's md file.  If it does,
  then you're done.  Similarly, if it matches a splitter, then run the
  splitter, match the resulting insns in the target md file and your done.

  Else generic.md should call into optabs.c to expand operations -- this will
  in large part allow target specific lowering using existing mechanisms.
  (ie, lowering of operands, running target expanders, etc etc).  It's likely
  we'll need to refine this some, but this is the general idea.

</pre>

<h4>Status</h4>

<p>A first version bootstraps and will be hopefully included in the branch
soon.  It is far from being complete thought.</p>

<h3>Miscellaneous Patches</h3>

<p>Achievements that have been merged into mainline already:</p>

<dl>
<dt>Easier to use CFG routines</dt>

<dd>Before branching, we reorganized and cleaned up the CFG routines
to make them easier to use.  Each RTL instruction now contains a
pointer to its basic block.  It is therefore easy to adjust basic
block boundary notes when instructions are reordered.</dd>

<dt><a href="../news/profiledriven.html">Profile
based optimization infrastructure</a></dt>
<dd></dd>
</dl>

<p>Patches that are waiting for review and integration into
mainline:</p>

<ul> <li>New code layout functions (file <code>cfglayout.c</code>)
based on the basic block reordering pass and capable of code
duplication.</li>

<li>A fix to loop discovery code, by Josef Zlomek.</li>

<li>CFG based jump threading.</li>

<li>Cross-jumping for computed jumps and trapping instructions</li>
</ul>

<h2>Subprojects already merged into mainline tree in 3.2 development period</h2>

<h3>Tracer / Superblock Formation</h3>

<h4>Theory</h4>

<p>The <em>tracer</em> also known as <em>tail duplication</em> or
<em>superblock formation</em> pass identifies commonly executed
sequences of basic blocks, so called <em>traces</em> and duplicates
the blocks in order to avoid side entrances to those sequences -
forming so-called <em>superblocks</em>.  The superblocks are
easier to optimize by conventional passes such as CSE and scheduler.
In case no extra optimizations apply, our cross-jumping pass merges the
code paths again after register allocation.</p>

<h4>Implementation in GCC</h4>

<p>We plan to do loop peeling and superblock formation in single pass
as described in <a href="#ref3">[3]</a>.</p>
 
<p>Enhancements that are currently only in the cfg-branch:</p>
<ul>
<li>Avoid rebuilding the  CFG for <code>toplev.c</code>.</li>

<li>Opcode heuristics tweak:
Avoid predicting comparisons for equality to 0 and floating point comparisons
to increase hitrate of opcode heuristic.</li>
</ul>

<h3>High Level Branch Prediction</h3>

<p>This work is done by <a
href="mailto:rakdver@atrey.karlin.mff.cuni.cz">Zdenek Dvorak</a>.</p>

<h4>Theory</h4>

<p>This is an infrastructure enhancement to propagate predictions from
the trees representation to the RTL-level.</p>

<h4>Implementation in GCC</h4>

<h3>Flexible and Safe GCOV Format</h3>

<p>This work is done by <a
href="mailto:bim@atrey.karlin.mff.cuni.cz">Pavel Nejedly</a>.</p>

<h4>Theory</h4>

<p>Currently the profile is fragile, since there is no verification
that the compiled program matches the profiled data.  Since the
profiler eliminates the redundancy in data, the mismatch is often not
discovered at all causing completely nonsensical results.
nonsense.</p>

<h4>Implementation in GCC</h4>

<p>We plan to calculate a checksum (CRC) of each graph when it is
constructed.  Also the format will be extended so that more
information can be recorded, such as histograms for the number of
iterations for loops that can then be used for loop peeling and loop
unrolling.</p>

<p>We also want to add versioning and further information to make it
look like a real file format.</p>

<p>There are a few references to much more advanced profiling systems
in <a href="#ref3">[3]</a>.</p>
<p>
This is done using <code>NOTE_INSN_PREDICTION</code> emitted in the
stream converted to <code>REG_PREDICTION</code> later.  For instance,
the compiler can now predict branches using <code>goto</code> as
infrequent.  It also allows to handle and specially predict constructs
such as return of (negative) constants, out of range code and, exception
handling where the front end knows that those are infrequent, but the
back end has no information about it.</p>

<p>We can also use it to preserve more information about high-level
loops.  For instance loop a with continue statement looks like two
nested loops to the back end, but the optimizer may be able to estimate
that the internal (continue) loop is not that frequent.</p>


<h2>Branch Status</h2>

<p>The branch was created from the development mainline on 12 November
2001.  Its CVS tag is called <code>cfg-branch</code>.</p>

<h2>Contributing</h2>

<p>Check out the <code>cfg-branch</code> branch with the command</p>

<pre>
cvs co -r cfg-branch gcc
</pre>

<p>then configure and build in the normal way.</p>

<p>Please post patches <a href="../contribute.html">in the usual way</a>
to the gcc-patches list, marked <code>[cfg-branch]</code>. As this is
a branch the usual maintainer rules do not apply. This branch is
maintained by <a href="mailto:jh@suse.cz">Jan Hubicka
&lt;jh@suse.cz&gt;</a>. Approval from the usual maintainers will be
needed when submitting patches from the branch for consideration for
the mainline.</p>

<h3>Readings</h3>

<p>Lots of useful information is present at the <a
href="http://impact.crhc.illinois.edu/">IMPACT compiler group</a>
homepage. Some other papers:</p>

<dl>
<dt>[1]</dt>

<dd><a href="https://doi.org/10.1145/155090.155119">Branch
Prediction for Free; Ball and Larus; PLDI '93.</a></dd>

<dt>[2]</dt>

<dd><a href="https://doi.org/10.1145/192724.192725">Static
Branch Frequency and Program Profile Analysis; Wu and Larus;
MICRO-27.</a></dd>

<dt id="ref3">[3]</dt>

<dd>Design and Analysis of Profile-Based Optimization in Compaq's
Compilation Tools for Alpha; Journal of Instruction-Level Parallelism 3
(2000) 1-25.</dd>

<dt>[4]</dt>

<dd><a href=
"https://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf">Accurate
Static Branch Prediction by Value Range Propagation; Jason R. C.
Patterson (jasonp@fit.qut.edu.au), 1995</a></dd>

<dt>[5]</dt>

<dd><a href="https://doi.org/10.1145/258916.258932">Near-optimal
Intraprocedural Branch Alignment; Cliff Young, David S. Johnson,
David R. Karger, Michael D. Smith, ACM 1997</a></dd>

<dt id="ref6">[6]</dt>

<dd><a href="https://doi.org/10.1145/305138.305178">Software
Trace Cache; International Conference on Supercomputing, 1999</a></dd>

<dt>[7]</dt>

<dd><a href="https://doi.org/10.1002/spe.4380211204">Using
Profile Information to Assist Classic Code Optimizations; Pohua P.
Chang, Scott A. Mahlke, and Wen-mei W. Hwu, 1991</a></dd>

<dt>[8]</dt>

<dd><a href=
"http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.1922">Hyperblock
Performance Optimizations For ILP Processors; David Isaac August,
1996 (Master Thesis)</a></dd>

<dt>[9]</dt>

<dd><a href="https://doi.org/10.1145/173262.155118">Reverse
If-Conversion; Nancy J. Warter, Scott A. Mahlke, Wen-mei W. Hwu, B.
Ramakrishna Rau; ACM SIGPLAN Notices, 1993</a></dd>
</dl>

</body>
</html>
